Computer-Using Agent

https://openai.com/index/computer-using-agent/

Powering Operator is Computer-Using Agent (CUA), a model that combines GPT‑4o's vision capabilities with advanced reasoning through reinforcement learning.

CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do.

人間のようにGUIを操作（逆にAPIは使わない）

👉schroneko/systemprompts chatgpt_operator_2025-02-22.txt

While CUA is still early and has limitations, it sets new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, and 58.1% on WebArena and 87% on WebVoyager for web-based tasks.

How it works

CUAへの入力

タスクはテキスト

スクリーンショットも与えられる

CUAはアクションを生成（テキストと理解）

アクションを仮想マシンに適用し、次のスクリーンショットを得る

タスク（テキスト）とともに入力される

https://images.ctfassets.net/kftzwdyauwt9/66EMZgoHtZBCDjjTZiWodY/e0472c662472755fa576522ce12a457d/Infographic_Transparent__Mobile_.png?w=1200&q=70&fm=webp

CUA processes raw pixel data to understand what’s happening on the screen and uses a virtual mouse and keyboard to complete actions.

仮想的なマウスとキーボードを持っている

Given a user’s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action

Evaluations

ブラウザ操作のベンチマーク

コンピュータ操作のベンチマーク

積ん読